claude 4
Future You: Designing and Evaluating Multimodal AI-generated Digital Twins for Strengthening Future Self-Continuity
Albrecht, Constanze, Archiwaranguprok, Chayapatr, Poonsiriwong, Rachel, Chen, Awu, Yin, Peggy, Lertsutthiwong, Monchai, Winson, Kavin, Hershfield, Hal, Maes, Pattie, Pataranutaporn, Pat
What if users could meet their future selves today? AI-generated future selves simulate meaningful encounters with a digital twin decades in the future. As AI systems advance, combining cloned voices, age-progressed facial rendering, and autobiographical narratives, a central question emerges: Does the modality of these future selves alter their psychological and affective impact? How might a text-based chatbot, a voice-only system, or a photorealistic avatar shape present-day decisions and our feeling of connection to the future? We report a randomized controlled study (N=92) evaluating three modalities of AI-generated future selves (text, voice, avatar) against a neutral control condition. We also report a systematic model evaluation between Claude 4 and three other Large Language Models (LLMs), assessing Claude 4 across psychological and interaction dimensions and establishing conversational AI quality as a critical determinant of intervention effectiveness. All personalized modalities strengthened Future Self-Continuity (FSC), emotional well-being, and motivation compared to control, with avatar producing the largest vividness gains, yet with no significant differences between formats. Interaction quality metrics, particularly persuasiveness, realism, and user engagement, emerged as robust predictors of psychological and affective outcomes, indicating that how compelling the interaction feels matters more than the form it takes. Content analysis found thematic patterns: text emphasized career planning, while voice and avatar facilitated personal reflection. Claude 4 outperformed ChatGPT 3.5, Llama 4, and Qwen 3 in enhancing psychological, affective, and FSC outcomes.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > United States > Virginia (0.04)
- (5 more...)
- Research Report > Strength High (1.00)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study > Negative Result (0.34)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.93)
- Information Technology > Security & Privacy (0.87)
- Health & Medicine > Consumer Health (0.68)
- Education > Educational Setting (0.67)
Probe-Rewrite-Evaluate: A Workflow for Reliable Benchmarks and Quantifying Evaluation Awareness
Xiong, Lang, Bhargava, Nishant, Hong, Jianhang, Chang, Jeremy, Liu, Haihao, Sharma, Vasu, Zhu, Kevin
Large Language Models (LLMs) often exhibit significant behavioral shifts when they perceive a change from a real-world deployment context to a controlled evaluation setting, a phenomenon known as "evaluation awareness." This discrepancy poses a critical challenge for AI alignment, as benchmark performance may not accurately reflect a model's true safety and honesty. In this work, we systematically quantify these behavioral changes by manipulating the perceived context of prompts. We introduce a methodology that uses a linear probe to score prompts on a continuous scale from "test-like" to "deploy-like" and leverage an LLM rewriting strategy to shift these prompts towards a more natural, deployment-style context while preserving the original task. Using this method, we achieved a 30% increase in the average probe score across a strategic role-playing dataset after rewriting. Evaluating a suite of state-of-the-art models on these original and rewritten prompts, we find that rewritten "deploy-like" prompts induce a significant and consistent shift in behavior. Across all models, we observed an average increase in honest responses of 5.26% and a corresponding average decrease in deceptive responses of 12.40%. Furthermore, refusal rates increased by an average of 6.38%, indicating heightened safety compliance. Our findings demonstrate that evaluation awareness is a quantifiable and manipulable factor that directly influences LLM behavior, revealing that models are more prone to unsafe or deceptive outputs in perceived test environments. This underscores the urgent need for more realistic evaluation frameworks to accurately gauge true model alignment before deployment.
Quantitative Analysis of Technical Debt and Pattern Violation in Large Language Model Architectures
As Large Language Models (LLMs) transition from code completion tools to autonomous system architects, their impact on long-term software maintainability remains unquantified. While existing research benchmarks functional correctness (pass@k), this study presents the first empirical framework to measure "Architectural Erosion" and the accumulation of Technical Debt in AI-synthesized microservices. We conducted a comparative pilot study of three state-of-the-art models (GPT-5.1, Claude 4.5 Sonnet, and Llama 3 8B) by prompting them to implement a standardized Book Lending Microservice under strict Hexagonal Architecture constraints. Utilizing Abstract Syntax Tree (AST) parsing, we find that while proprietary models achieve high architectural conformance (0% violation rate for GPT-5.1), open-weights models exhibit critical divergence. Specifically, Llama 3 demonstrated an 80% Architectural Violation Rate, frequently bypassing interface adapters to create illegal circular dependencies between Domain and Infrastructure layers. Furthermore, we identified a phenomenon of "Implementation Laziness," where open-weights models generated 60% fewer Logical Lines of Code (LLOC) than their proprietary counterparts, effectively omitting complex business logic to satisfy token constraints. These findings suggest that without automated architectural linting, utilizing smaller open-weights models for system scaffolding accelerates the accumulation of structural technical debt.
Kodezi Chronos: A Debugging-First Language Model for Repository-Scale Code Understanding
Khan, Ishraq, Chowdary, Assad, Haseeb, Sharoz, Patel, Urvish, Zaii, Yousuf
Large Language Models (LLMs) have advanced code generation and software automation but remain constrained by inference-time context and lack structured reasoning over code, leaving debugging largely unsolved. While Claude 4.5 Opus achieves 74.40% on SWE-bench Verified and Gemini 3 Pro reaches 76.2%, both models remain below 20% on real multi-file debugging tasks. We introduce Kodezi Chronos-1, a language model purpose-built for debugging that integrates Adaptive Graph-Guided Retrieval to navigate codebases up to 10 million lines (92% precision, 85% recall), Persistent Debug Memory trained on over 15 million sessions, and a seven-layer fix-test-refine architecture. On 5,000 real-world scenarios, Chronos-1 achieves 67.3% +/- 2.1% fix accuracy compared to 14.2% +/- 1.3% for Claude 4.1 Opus and 13.8% +/- 1.2% for GPT-4.1 (Cohen's d = 3.87). On SWE-bench Lite, Chronos-1 reaches a state-of-the-art 80.33% resolution rate (241 of 300), outperforming the next best system by 20 points and achieving repository-specific highs of 96.1% on Sympy and 90.4% on Django. Chronos-1 reduces debugging time by 40% and iterations by 65%, resolving complex multi-file and cross-repository bugs that require temporal analysis. Limitations remain for hardware-dependent and dynamic language errors, and Chronos-1 will be available in Kodezi OS in Q4 2025 and via API in Q1 2026.
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks
Zhao, Songwen, Wang, Danqing, Zhang, Kexun, Luo, Jiaxuan, Li, Zhuo, Li, Lei
Vibe coding is a new programming paradigm in which human engineers instruct large language model (LLM) agents to complete complex coding tasks with little supervision. Although it is increasingly adopted, are vibe coding outputs really safe to deploy in production? To answer this question, we propose SU S VI B E S, a benchmark consisting of 200 feature-request software engineering tasks from real-world open-source projects, which, when given to human programmers, led to vulnerable implementations. We evaluate multiple widely used coding agents with frontier models on this benchmark. Disturbingly, all agents perform poorly in terms of software security. Although 61% of the solutions from SWE-Agent with Claude 4 Sonnet are functionally correct, only 10.5% are secure. Further experiments demonstrate that preliminary security strategies, such as augmenting the feature request with vulnerability hints, cannot mitigate these security issues. Our findings raise serious concerns about the widespread adoption of vibe-coding, particularly in security-sensitive applications.
- South America > Brazil (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Evaluating Large Language Models on the 2026 Korean CSAT Mathematics Exam: Measuring Mathematical Ability in a Zero-Data-Leakage Setting
Pyeon, Goun, Heo, Inbum, Jung, Jeesu, Hwang, Taewook, Namgoong, Hyuk, Seo, Hyein, Han, Yerim, Kim, Eunbin, Kang, Hyeonseok, Jung, Sangkeun
This study systematically evaluated the mathematical reasoning capabilities of Large Language Models (LLMs) using the 2026 Korean College Scholastic Ability Test (CSAT) Mathematics section, ensuring a completely contamination-free evaluation environment. To address data leakage issues in existing benchmarks, we digitized all 46 questions (22 common and 24 elective) within two hours of the exam's public release, eliminating any possibility of inclusion in model training data. We conducted comprehensive evaluations of 24 state-of-the-art LLMs across varying input modalities (Text-only, Image-only, Text+Figure) and prompt languages (Korean, English). The GPT-5 family models achieved perfect scores (100 points) under a limited set of language-modality configurations, while Grok 4, Qwen 3 235B, and Gemini 2.5 pro also scored above 97 points. Notably, gpt-oss-20B achieved 95.7 points despite its relatively small size, demonstrating high cost-effectiveness. Problem-specific analysis revealed Calculus as the weakest domain with significant performance degradation on 4-point high-difficulty problems. Text input consistently outperformed image input, while prompt language effects varied by model scale. In reasoning enhancement experiments with GPT-5 series, increased reasoning intensity improved performance (82.6->100 points) but quadrupled token usage and drastically reduced efficiency, suggesting that models with minimal reasoning may be more practical. This research contributes: (1) implementation of a completely unexposed evaluation environment, (2) a standardized digitization pipeline that converts human-targeted exam materials into LLM-ready evaluation data, and (3) a practical evaluation perspective integrating performance, cost, and time considerations. Detailed results and model comparisons are available at the 2026 Korean CSAT LLM Evaluation Leaderboard; https://isoft.cnu.ac.kr/csat2026/
- North America > Dominican Republic (0.04)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.93)
Towards Trustworthy Legal AI through LLM Agents and Formal Reasoning
Chen, Linze, Cai, Yufan, Hou, Zhe, Dong, Jinsong
The rationality of law manifests in two forms: substantive rationality, which concerns the fairness or moral desirability of outcomes, and formal rationality, which requires legal decisions to follow explicitly stated, general, and logically coherent rules. Existing LLM-based systems excel at surface-level text analysis but lack the guarantees required for principled jurisprudence. We introduce L4M, a novel framework that combines adversarial LLM agents with SMT-solver-backed proofs to unite the interpretive flexibility of natural language with the rigor of symbolic verification. The pipeline consists of three phases: (1) Statute Formalization, where domain-specific prompts convert legal provisions into logical formulae; (2) Dual Fact and Statute Extraction, in which prosecutor- and defense-aligned LLMs independently map case narratives to fact tuples and statutes, ensuring role isolation; and (3) Solver-Centric Adjudication, where an autoformalizer compiles both parties' arguments into logic constraints, and unsat cores trigger iterative self-critique until a satisfiable formula is achieved, which is then verbalized by a Judge-LLM into a transparent verdict and optimized sentence. Experimental results on public benchmarks show that our system surpasses advanced LLMs including GPT-o4-mini, DeepSeek-V3, and Claude 4 as well as state-of-the-art Legal AI baselines, while providing rigorous and explainable symbolic justifications.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Asia > Singapore (0.04)
- Law > Criminal Law (1.00)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.94)
More with Less: An Empirical Study of Turn-Control Strategies for Efficient Coding Agents
LLM-powered coding agents, which operate in iterative loops (turns) to solve software engineering tasks, are becoming increasingly powerful. However, their practical deployment is hindered by significant and unpredictable costs. This challenge arises from a combination of factors: quadratically growing token counts with each turn, the high price of models, the large number of turns required for real-world tasks, and the tendency of agents to take inefficient or unnecessary actions. While existing research focuses on optimizing individual turns, the strategic control of the total number of turns remains an underexplored area for managing agent performance and cost. To address this gap, we conduct a comprehensive empirical study on SWE-bench using three state-of-the-art models and evaluate the impact of three distinct turn-control strategies: an unrestricted baseline, a fixed-turn limit with reminders, and a novel dynamic-turn strategy that grants extensions on-demand. Our findings first reveal a fundamental trade-off in the unrestricted setting, where no single model excels across performance, cost, and turn efficiency. We then show that a fixed-turn limit, specifically at the 75th percentile of the baseline, serves as a "sweet spot", substantially reducing costs (by 24%-68%) with minimal impact on solve rates. Most significantly, the dynamic-turn strategy consistently outperforms fixed-limit approaches, achieving comparable or better solve rates while further reducing costs by an additional 12%-24% by intelligently allocating resources only to tasks that need them. This work provides the first systematic analysis of turn-control strategies, offering simple yet effective guidelines for developers to balance cost and efficacy. We demonstrate that dynamic resource allocation is a superior, easy-to-implement approach for deploying powerful yet economically viable coding agents.
- South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.05)
- Asia > China > Beijing > Beijing (0.04)
- North America > United States > New York > New York County > New York City (0.04)
Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly?
Xia, Chunqiu Steven, Wang, Zhe, Yang, Yan, Wei, Yuxiang, Zhang, Lingming
Large Language Models (LLMs) are reshaping almost all industries, including software engineering. In recent years, a number of LLM agents have been proposed to solve real-world software problems. Such software agents are typically equipped with a suite of coding tools and can autonomously decide the next actions to form complete trajectories to solve end-to-end software tasks. While promising, they typically require dedicated design and may still be suboptimal, since it can be extremely challenging and costly to exhaust the entire agent scaffold design space. Recognizing that software agents are inherently software themselves that can be further refined/modified, researchers have proposed a number of self-improving software agents recently, including the Darwin-Gödel Machine (DGM). Meanwhile, such self-improving agents require costly offline training on specific benchmarks and may not generalize well across different LLMs or benchmarks. In this paper, we propose Live-SWE-agent, the first live software agent that can autonomously and continuously evolve itself on-the-fly during runtime when solving real-world software problems. More specifically, Live-SWE-agent starts with the most basic agent scaffold with only access to bash tools (e.g., mini-SWE-agent), and autonomously evolves its own scaffold implementation while solving real-world software problems. Our evaluation on the widely studied SWE-bench Verified benchmark shows that LIVE-SWE-AGENT can achieve an impressive solve rate of 77.4% without test-time scaling, outperforming all existing software agents, including the best proprietary solution. Moreover, Live-SWE-agent outperforms state-of-the-art manually crafted software agents on the recent SWE-Bench Pro benchmark, achieving the best-known solve rate of 45.8%.
Arctic-Extract Technical Report
Chiliński, Mateusz, Ołtusek, Julita, Jaśkowski, Wojciech
Arctic-Extract is a state-of-the-art model designed for extracting structural data (question answering, entities and tables) from scanned or digital-born business documents. Despite its SoTA capabilities, the model is deployable on resource-constrained hardware, weighting only 6.6 GiB, making it suitable for deployment on devices with limited resources, such as A10 GPUs with 24 GB of memory. Arctic-Extract can process up to 125 A4 pages on those GPUs, making suitable for long document processing. This paper highlights Arctic-Extract's training protocols and evaluation results, demonstrating its strong performance in document understanding.
- North America > United States (0.68)
- South America > Chile (0.04)
- Africa > Ethiopia > Addis Ababa > Addis Ababa (0.04)
- Law (0.69)
- Leisure & Entertainment (0.46)
- Information Technology (0.46)